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Abstract 

A lattice gauge theory framework for simulations on graphic processing units 
(GPUs) using NVIDIA's CUDA is presented. The code comprises template 
classes that take care of an optimal data pattern to ensure coalesced read- 
ing from device memory to achieve maximum performance. In this work 
we concentrate on applications for lattice gauge fixing in 3+1 dimensional 
SU(3) lattice gauge field theories. We employ the overrelaxation, stochastic 
relaxation and simulated annealing algorithms which are perfectly suited to 
be accelerated by highly parallel architectures like GPUs. The applications 
support the Coulomb, Landau and maximally Abelian gauges. Moreover, we 
explore the evolution of the numerical accuracy of the SU(3) valued degrees 
of freedom over the runtime of the algorithms in single (SP) and double pre- 
cision (DP). Therefrom we draw conclusions on the reliability of SP and DP 
simulations and suggest a mixed precision scheme that performs the critical 
parts of the algorithm in full DP while retaining 80-90% of the SP perfor- 
mance. Finally, multi-GPUs are adopted to overcome the memory constraint 
of single GPUs. A communicator class which hides the MPI data exchange 
at the boundaries of the lattice domains, via the low bandwidth PCI-Bus, 
effectively behind calculations in the inner part of the domain is presented. 
Linear scaling using 16 NVIDIA Tesla C2070 devices and a maximum perfor- 
mance of 3.5 Teraflops on lattices of size down to 64 3 x 256 is demonstrated. 
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1. Introduction 

Quantum Chromodynamics (QCD) is nowadays, 40 years after its birth, 
widely accepted as the correct theory of the strong nuclear force which binds 
the protons and neutrons in the cores of atoms. The guiding principle in 
the construction of QCD was the local gauge symmetry which has led be- 
fore to the very successful theory of quantum electrodynamics (QED) that 
describes the interactions of electrons and light. Local gauge symmetry is 
the freedom to perform a transformation of the vector fields of the theory, 
independently at each point of space-time, without changing the physics the 
theory describes. 

Lattice QCD which lives on a discretized space-time background opposed 
to the continuous world of the original theory, offers a formulation of the 
gauge theory that is well suited to be simulated on a computer and hence 
can be used to test the theory against experiment. Furthermore, lattice 
simulations can help to gain insights in the highly nontrivial, nonperturbative 
regime of the interactions between quarks and gluons which are the degrees 
of freedom of QCD. 

The gauge symmetry, given below in its discrete version, states that phys- 
ical observables will remain unchanged if a local transformation of the form 



is being carried out. Here, the gauge fields or link variables U^(x) as well as 
the gauge transformations g(x) are elements of the underlying gauge group 
which is SU(3) in the case of QCD. The index \i = 0, . . . , 3 refers to the 
direction in four dimensional space-time and with x + fx we denote the neigh- 
bor lattice site of x in the /i-direction. The link variables of lattice QCD are 
connected to the algebra valued continuum gauge fields A^x) via 



Whereas physical observables that can be measured in experiments must 
be independent of the gauge, fixing the gauge, i.e., choosing a particular 
gauge transformation g{x) for all x, is essential when, e.g., studying gauge 
dependent quantities like the fundamental two point functions of the theory. 




(1) 




(2) 
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As a typical example of a gauge condition that may be enforced at all 
space-time points x, we consider the manifestly covariant Landau gauge 

d^x) = 0, (3) 

here stated in the language of continuum field theories. As we will discuss in 
the next Section, the continuum gauge condition Eq. ([3]) translates to a large 
scale optimization problem in lattice QCD with 0(VN^) degrees of freedom 
where V = x N t is the 3 + 1 dimensional lattice volume. Consequently, the 
process of fixing the gauge on the lattice demands a major part of the whole 
simulation's computer time and the possible acceleration by highly parallel 
hardware architectures like graphic processing units (GPUs) will be of great 
practical use. 

A more conceptual issue of gauge fixing is that the set of gauge transfor- 
mations g(x) that fulfill a desired gauge condition is far from being unique. 
The set of gauge equivalent configurations of a given gauge field is called the 
gauge orbit. The gauge fixing condition can be depicted as a hypersurface 
living in the space of all gauge fields. Each of the multiple intersections of 
the gauge orbit with the gauge fixing hypersurface is called a Gribov copy. 

Gribov copies play a crucial role in restoring the BRST symmetry on the 
lattice: fixing a gauge via the Faddeev-Popov procedure on the lattice for 
a compact group boils down to inserting the sum over signs of the corre- 
sponding Faddeev-Popov determinants evaluated at all the Gribov copies. 
Neuberger [l[ showed that the sum for any covariant gauge turns out to be 
zero for any standard model gauge group, SU(iV), and for compact U(l), 
making the expectation value of a gauge- fixed observable 0/0. The zero 
comes up because each Gribov copy comes in pairs with opposite sign of the 
Faddeev-Popov determinant. This in turn makes it impossible to construct 
a BRST symmetry on the lattice. This is called the Neuberger 0/0 prob- 
lem. Following a topological interpretation of the Neuberger 0/0 problem, in 
Refs. @, S] a modified lattice Landau gauge was proposed which evaded the 
problem. There, because the Faddeev-Popov is shown to be strictly positive 
(semi-)definite, the cancellation is avoided. However, it is yet to be shown 
that the number of Gribov copies in the modified lattice Landau gauge is 
independent of the background gauge field. Interestingly, recently, a deep 
relation between lattice gauge fixing and lattice supersymmetry has been 
proposed in Ref. [1, 0]: the partition functions of a class of supersymmetric 
Yang-Mills theories can be viewed as a gauge fixing partition function a la 
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Faddeev-Popov and the "Gribov copies" are then nothing but the classical 
configurations of the theory. 

A possible way out of the problem of the existence of Gribov copies is 
to restrict the gauge fixing hypersurface to a region which the gauge orbit 
intersects only once. An example thereof is the so-called Fundamental Mod- 
ular Region [6| which contains only that intersection of the gauge orbit which 
corresponds to the global optimum of the gauge condition. Unfortunately, no 
algorithm is known which finds the global optimum of the gauge condition 
within finite simulation time. Simulated annealing, however, has been shown 
to highly favor optima closer to the global optimum 0, 0| arid moreover it 
can be shown that, in the limit of infinite time, simulated annealing actually 
converges to the global maximum. 

Recently, the problem of counting Gribov copies has gained a renewed 
interest. In Refs. [9[, an explicit formula of the number of Gribov copies for 
any number of lattice sites is analytically derived for lattice Landau gauge 



for the one-dimensional compact U(T) case. In Refs. 10|, 111], a novel method 



based on Algebraic Geometry |12l . Il3l . |14| . which can count all the Gribov 



copies, was proposed. Although the method has only been able to work 
for small lattices, it is the only known method which guarantees to find all 
Gribov copies and hence it can work as a benchmark for other methods. 
One such alternative method is plain brute force, i.e., running a standard 
optimization algorithm over and over again from different starting points 
on the gauge orbit and collecting the results consecutively. Clearly, a high 
performance lattice gauge fixing code is essential for this task and since here 
one primarily focuses on small lattices, GPUs are favorable given the fact that 
CPU parallelization techniques are very limited for lattices of small extent. 

In this work we present a set of applications for lattice gauge fixing based 
on the family of relaxation algorithms and simulated annealing. The appli- 
cations are based on the CUDA accelerated Lattice-Graz-Tiibineren co 
that is written in CUDA/C++ and makes heavy use of template classes in 
order to facilitate the extension to a broad variety of applications. Besides 



the standard relaxation algorithm [15j, we support overrelaxation [16[ and 



stochastic relaxation 17| to overcome the problem of critical slowing down. 
Moreover, the simulated annealing algorithm [7| with a heatbath kernel and 
microcanonical updates which increases the probability to reach the Funda- 



1 Available for download at www.cuLGT.com 
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mental Modular Region has been implemented and tested. The code can be 
used to fix gauge configurations to the covariant Landau gauge d^A^ = 0, 
\x = 0, . . . , 3, the Coulomb gauge d{Ai — 0, i — 1, 2, 3 and the maximally 
Abelian gauge. 

A first attempt of porting lattice gauge fixing with the overrelaxation 
algorithm to the GPU has been reported in 18|, [19( . An alternative approach 
based on the steepest descent method has been presented in [20| . For a more 
general discussion of lattice gauge fixing and its problems we refer the reader 



to 
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The remainder of this work is organized as follows: in Sec. |2]the optimiza- 
tion problem is stated and the algorithms of choice are presented. In Sec. [3] 
we summarize some hardware properties of the NVIDIA GPUs that we use 
for our investigation and moreover briefly discuss NVIDIA's programming 
environment CUDA. Next, in Sec. HI we give details of our implementation 
and the cuLGT framework and moreover discuss numerical accuracy issues. 
To overcome the memory constraint of single GPUs we extend our imple- 
mentation to support multi-GPUs; all details thereto are presented in Sec. EJ 
Finally, in Sec. E]we show various performance results for single and multiple 
GPUs and furthermore present some convergence results of the algorithms. 
In Sec. [7] we summarize and conclude. 



2. The algorithms 

In this Section we will first summarize the defining equations of the opti- 
mization problem. Subsequently, we discuss the various flavors of the update 
kernels and finally we list the main underlying algorithm of this work explic- 
itly in terms of pseudo-code. 

2.1. The gauge junctionals 

On the lattice, enforcing a gauge condition, e.g., Eq. ([3]) is equivalent to 
maximizing the corresponding gauge functional. We support three different 
kinds of gauge conditions and here we give the related gauge functionals and 
moreover a measure of the iteratively achieved gauge quality that can serve 
as a stopping criterion for the algorithm. 
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2.1.1. Coulomb and Landau gauge 

The continuum Landau gauge condition, Eq. (J3j), is fulfilled if and only if 
the lattice gauge functional 

*L*JV\ = j^fyKt tr l U9 M > ( 4 ) 

C ^ Mi 1 

resides in a stationary point with respect to gauge transformations g(x) G 
SU(A r c ). In the above equation we made use of the short hand notation 

U°(x)=g(x)U,(x)g(x + fiy. (5) 

Furthermore, with N c we denote the dimension of the gauge group SU(N C ), 
N c = 3 for QCD, Nd is the number of space-time dimensions, (Nd = 4 
for our work) and V is the number of lattice points. When switching to 
Coulomb gauge, all that changes is that the sum in Eq. (j3J) becomes limited 
to the spatial components of the Dirac index /i, thus leaving out the temporal 
one. Consequently, the optimization of Eq. (jlj) for Coulomb gauge can be 
performed independently on different time-slices. 

A measure 9 of how well the Landau/Coulomb gauge condition is satisfied 
on a given gauge field configuration is the average L 2 -norm of the gauge fixing 
violation A(x), i.e., the discrete derivative of the continuum gauge fields 

A(x) = (M*) ~ M* - A)) = o, (6) 

c X 

2.1.2. Maximally Abelian gauge 

The gauge functional for the maximally Abelian gauge is, in the case of 
SU(2), given by 

F^ AG2 [U] = ^ tr hU,(x)a 3 U,(x)^] (8) 

X,fl 

where ex 3 is the diagonal matrix of the three Pauli matrices that correspond 
to the generators of SU(2). Equivalently, in the case of SU(3) the gauge 
functional reads 

^MAGst^] = E tr [WIW) 1 ] + tr [A 8 ^(x)A 8 ^(x)t] (9) 

d x,n 
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where A3 and A§ build the Cartan subalgebra of SU(3). Maximizing Eq. (JH]) 
is equivalent to minimizing the off-diagonal components A^(x), i 7^ 3, 8 of 
the continuum gauge fields 

1 8 

A,{x) = -Y,hAf{x). (10) 

i=l 

Note that maximizing Eq. (JSJ) or Eq. (jUj), respectively, is equivalent to maxi- 
mizing the squares of the diagonal of each gauge link 



X,fl,t 



which is the gauge functional that we use in practice. 

When the SU(2) gauge functional Eq. (jHJ) is stationary with respect to 
gauge transformations, then the off-diagonal elements of 

X(x) = {U,{x)a z U^ + U,(x - tfa 3 U^x - p,)) (12) 



must vanish |22j. Thus, for SU(2) we can use 

e = wvJ2\( x ^\ 2 ( 13 ) 

c X 

as a measure of the gauge quality. The off-diagonal element (X(x))±2 reads 
explicitly 

(X(x))i2 = ^ 2(^o (x) tt^a (x) +« Ml i(a:)« Ml 3(ar) 

-^,0(^)^,1 (a?) - ^,2(^)^,3(2;) (14) 

+U tl fl{x - ft)u^ 2 (x - A) + ?V,l( X _ A)«M,3(a - A) 

+m M)0 (a; - p)u^i(x - p) - iu^x - p)u^ 3 (x - p)) 
where we adopted the Cayley-Klein parametrization 



7 



For SU(3), we use equivalently 

c X 

where the matrices X(x),Y(x), Z(x) G SU(2) stem from the three SU(2) 
subgroups of SU(3). 

2.2. Relaxation 

Now that we stated the optimization problem, we can proceed with pre- 
senting the algorithms which we will use to find a solution to the problem 
before we will discuss the implementation with CUDA in the next Section. 

The main idea of the relaxation algorithm is to sweep over the lattice site 
by site while optimizing the gauge functional locally. Thereby, as we will see 
below, can all sites of one of the two parity subsets (think of a checker board 
decomposition) be optimized at the same time since the newly generated 
local optimum is a function of the nearest neighbors only. 

In the following we will discuss the calculation of the local optimum sep- 
arately for Coulomb/Landau gauge and the maximally Abelian gauge. 

2.2.1. Coulomb and Landau gauge 

Instead of taking the complete global gauge functional into account, 

^Landau W] = 2jV NjV^ ^Landau ( X ) i (^) 

the relaxation algorithm aims at optimizing the value of F 9 [U] locally, i.e., 
for all x the maximum of 

/llndau^) = tr [ 9 (x)K(x)\ (18) 

is sought. Here we introduced 

K(x) := (U»{x)g{x + A) f + U^x - tfg(x - /}) f ) (19) 

where the sum runs over all space-time indices for Landau gauge and for 
Coulomb gauge it leaves out the temporal index. The local maximum thereof 
is, in the case of the gauge group SU(2), simply given by 

g{x) = K(x)t/y/detK(xy. (20) 
For the gauge group SU(3) (QCD) one iteratively operates in the three SU(2) 



subgroups [23| and thereby optimizes the local SU(3) gauge functional. 
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2.2.2. Maximally Abelian gauge 

Similarly as for the Coulomb and Landau gauges, the goal is to maximize 
the gauge functional Eq. (jSJ) locally. Again, we only need to know how to 
achieve this for SU(2) and then we can operate in the SU(2) subgroups of 
SU(3) for applications in QCD. 

Thus, for a given site x we want to maximize 



fuAGi( x ) = ^2^[^3g(x)U fl {x)a 3 U^{x)^g( 



X 



,t 



(21) 

+a 3 U^{x - p,yg(x)(7 3 g(x^U^(x - p.)] . 

Let us focus on the part of Eq. (I2~TT) with the up-going links only, i.e., the 
first term in the sum; the second term of Eq. (I2~TT) can be treated equivalently. 

For the following discussion it will be useful to switch to the Cayley-Klein 
parametrization of g(x) and U^x), 
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i=l 

and 



g = goi + l Yg l a i =( 9o + * 93 92 + igi ) (22) 



92 + Wi 9o ~ W3 



U — I + iu fM,3 M M,2 + M*a«.1 ] (23) 

respectively, where for a simpler notation we suppressed the space-time ar- 
gument x. 

Taking the fact that transformations proportional to a 3 leave the func- 
tional Eq. OH]) unchanged into account (thus setting g 3 = 0) one obtains 

1 2 / 2,2,2 2 \ 

<•))■ 

(24) 

Using a matrix/vector notation with g T 
written as 

f D 

/mag2( x ) = 2 fl ,T I E ~D \g (25) 








- U l,2 ~ 


H <a) ) ■ 


= (90 ; 


9u92) T 


E 




-D 


U 





—D/ 
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where we defined 

E = 2 ^2 ( -u m,o u m,i ~ u^u^s) (27) 
F = 2 {-u^ Q u^2 + u^u^) (28) 

whereby in D we used det = + u 2 ^ + + 3 = 1. 

Then the maximum of Eq. (|24|) is found when g is set to the eigenvector 
of the matrix of Eq. (1251) corresponding to the largest eigenvalue. The largest 
eigenvalue is A = \/D 2 + E 2 + F 2 and the corresponding eigenvector is 

(D + VD 2 + E 2 + F 2 , E , fY . (29) 



We refer the reader to [22| for more practical details related to the max- 
imally Abelian gauge. 

2.2.3. Overrelaxation 

In order to reduce the critical slowing down of the relaxation algorithm 
on large lattices, the authors of [l6[ suggested to apply an overrelaxation 
algorithm which replaces the local gauge transformation g(x) by g w (x), oj G 
[1, 2) in each step of the iteration. In practice the exponentiation of the gauge 
transformation will be done to first order. 

2.2.4- Microcanonical steps 

Applying a gauge transformation g ul (x) with u = 2 leaves the Lan- 
dau/Coulomb gauge functional invariant but these so-called microcanonical 
steps have the beneficial property to lead to a faster decorrelation and thus 
to faster convergence of the functional from which the simulated annealing 
algorithm will profit. 

2.2.5. Stochastic relaxation 

The stochastic relaxation algorithm replaces the local gauge update g(x) 
by a microcanonical step g 2 (x) with probability p and can lead to faster 
convergence on large lattices. 
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2.3. Simulated annealing 

Annealing is a method in condensed matter physics to bring certain ma- 
terials in their ground state by first heating them above their melting point 
and subsequently cooling them down very slowly. It is crucial hereby that 
the system is given enough time to thermalize at each temperature step. If 
so, the atoms will arrange themselves in such a way that the macroscopic 
system ends up in its - or at least close to its - lowest energy state. 

The authors of 0] developed an analogy of annealing and mathematical 
optimization problems. Following this analogy, the function which is to be 
optimized corresponds to the energy of the solid and the optimum is the 
ground state. 

The algorithm then simply performs local Metropolis updates where the 
acceptance probability of a random local gauge update g(x) is given by 

f 1 if f 9 {x) > f(x) 

P[9{X)] = jexp (fflzZM) else. (3 ° } 

Thus, while in a hot temperature regime, the algorithm accepts a worsening 
of the local gauge functional with a non-vanishing probability which ensures 
that the algorithm may overcome local extrema in order to increase the 
probability to find the global optimum. 

In practice, the Metropolis update gets replaced by heatbath updates that 
generate the new gauge transformation directly with the right Boltzmann like 
probability distribution^ In order to reach quicker thermalization at each 
temperature step, we perform three microcanonical steps after each change 
in temperature. Note that simulated annealing will never reach the required 
gauge precision 9 very accurately, instead relaxation or overrelaxation which 
can be regarded as simulated annealing in the limit of zero temperature, 
should be run subsequently to fully reach the required precision. See also 
Sec. 131 

2.4- Putting things together 

After we have listed the details of the underlying large scale optimization 
problem and the techniques to perform local optimizations, we are now in 
the position to consider the global optimization algorithm. 



2 We use the Philox RNG of the Randoml23 library [24| to generate random numbers 
in the heatbath kernel. 
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As mentioned before, due to the strict locality of the family of relax- 
ation algorithms and the simulated annealing algorithm, we can perform a 
checkerboard decomposition of the lattice and operate on all sites of one of 
the two sublattices even and oda% concurrently. All of the above mentioned 
algorithms have the same underlying structure which is depicted in Alg. [TJ 



Algorithm 1 

while precision not reached do 
for sublattice = even, odd do 
for all x of sublattice do 
for all SU(2) subgroups do 

local optimization: find g(x) G SU(2) Step 1. 

which is a function of C/ M (x) , U^(x — fi) 
for all ji do 

apply g(x) to U^x), U^x - p) Step 2. 

end for 
end for 
end for 
end for 
end while 



We want to stress that the difference of the various update algorithms as 
well as the difference between the gauges under consideration lies exclusively 
in Step 1 whereas, as we list explicitly in Appendix A the main work of the 
algorithm lies in Step 2 which is independent of the update type and of the 
target gauge. 



3. CUDA 

Here we briefly introduce the CUDA (Compute Unified Device Archi- 
tecture) programming model and summarize the hardware properties of the 
GPUs we adopt in our study. 

3.1. The programming model 

The CUDA model demands the division of the underlying problem into 
subproblems, so-called thread blocks, that can be treated independently from 



3 The sum over the space-time indices t+x+y+z determines whether a site is considered 
even or odd. 
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each other in parallel. These thread blocks, on the other hand, are ensembles 
of threads and the threads within a thread block may communicate with 
each other through shared memory. The independent thread blocks then 
form a so-called grid. This model is very flexible and allows the user to 
run a CUDA application on different hardwares (meaning different number 
of streaming multiprocessors (SMs) and CUDA cores) without the need for 
major adjustments^ This abstraction layer is introduced into the C language 
by defining a new set of functions which are called kernels and are identified 
by the __ global— declaration specifier. The kernel is executed N times where 

iV = block size x grid size (31) 

and each kernel call is a thread in the nomenclature introduced above. For 
the invocation of the kernel a new syntax is introduced where the block size 
and the total number of blocks (grid size) is specified. A unique index is 
given to each thread to assign, e.g., different memory addresses to different 
threads. 

A group of 32 threads (the number depends on the hardware generation) 
of the same block are tied together to what is called a warp. The operations 
of all threads within a warp are executed simultaneously as long as they 
follow the same instruction path. Otherwise, the operations become serialized 
resulting in up to 32 cycles instead of one, a warp divergence occurs. 

To efficiently hide memory latencies it is inevitable to have many warps 
active at the same time on a SM. The possible number of active blocks (or 
warps) depends on the available hardware that has to be divided among 
the threads, e.g., it depends on how many registers and how much shared 
memory is needed for an individual kernel. 

3.2. Memory Layout 

In the CUDA terminology the CPU on which the CUDA application is 
run is called the host, whereas the GPU is called device and the associated 
memory is called host and device memory, respectively. Communication be- 
tween host and device memory is the main bottleneck. Although, for many 
single GPU implementations, communication is only necessary in the be- 
ginning and in the end of an application. How one effectively can deal with 



Since we exclusively adopt devices of the Fermi generation, the characteristic of the 
SMs is always the same for our tests, see Sec. 13.31 
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compute capability 


2.0 


cores / SM 


32 per SM 


warp size 


32 


LI cache / SM 


16 KB or 48 KB 


shared memory / SM 


16 KB or 48 KB 


32-bit registers / SM 


32768 (32k) 


max. registers / thread 


63 



Tabic 1: Specifications of the Fermi architecture. 



communication from device to device through the host memory in multi-GPU 
simulations is discussed in Sec. [5j The part of device memory that is accessi- 
ble from the host as well as from all CUDA threads is called global memory. 
Global memory is allocated by a command in the host code. Each thread 
may then allocate its private local memory which resides in the same physical 
memory as global memory. Global and local memory are both cashed in a 
LI and L2 cache by default (for Fermi), on a cash miss the latency to device 
memory is very high. For most applications the bandwidth to device memory 
is another limiting factor, although it is large compared to a common CPU 
to RAM bandwidth. 

For communication within a block shared memory can be used. Shared 
memory has a very low latency since it resides in the same hardware as the 
LI cache. 

3.3. Hardware 

We adopt four different NVIDIA Fermi GPUs for our study, the GTX 480 
and GTX 580 from the consumer section and moreover the Quadro 4000 and 
the Tesla C2070 from the scientific/HPC section. The Tesla C2070, opposed 
to the consumer cards, supports ECC (error correcting code) protection for 
DRAM. Recently, the successor of the Fermi architecture has been released 
(Kepler). In Tab. [I] we give the data which is common to all Fermi GPUs, 
the hardware details of the individual devices are summarized in Tab. [2j 

4. Implementation details 

4-1. Code design 

The design goal of our code was the minimization of local memory usage. 
One of the main limiting factors of performance is the number of registers 
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GTX 480 


GTX 580 


Quadro 4000 


Tesla C2070 


graphics clock 


700 MHz 


772 MHz 


475 MHz 


575 MHz 


SMs 


15 


16 


8 


14 


total CUDA cores 


480 


512 


256 


448 


device memory 


1.5 GB 


1.5 GB 


2 GB 


6 GB 


memory bandw. 


177.4 GB/s 


192.4 GB/s 


89.6 GB/s 


144 GB/s 



Tabic 2: Hardware details of the Fermi devices that we adopt in this work. 

that are available per thread: on Fermi GPUs, the latter bound is 63 reg- 
isters of 32-bit each. If more variables (on the assembly level) are needed 
per thread, the registers are "spilled" to local memory. Local memory, as 
mentioned earlier, uses the same hardware as global memory and thus has 
the same (high) latency and bandwidth bounds. Besides register spilling an- 
other source of local memory usage may slow down the execution of a kernel: 
registers are not addressable and therefore will arrays generally be placed in 
local memory. In order to capacitate the compiler to place arrays in regis- 
ters, the size of the arrays and all index variables that access elements need 
to be computable at compile time^. Early versions of our code fulfilled this 
requirement by manually unrolling all loops and using C macros to access 
array elements. The present code, however, uses template parameters for 
the lattice dimensions and the dimension N c of the gauge group SU(A r c ). As 
a consequence, unrolling can perfectly be done by the compiler. This code 
design offers a very flexible setup for further lattice applications. 

4-2. Reduce memory transfers 

In order to reduce memory transfers between global memory and the ker- 
nel a 12 parameter representation of the SU(3) matrices has been suggested 



25] , i.e., only two rows of the matrix are stored and loaded. If we denote 
the first and the second row of the matrix with vectors u and v, respectively, 
then the third row is given by (m x d)*. The extra numerical work to re- 
construct the full matrix is hidden since our kernels are bound by memory 
transactions and not by floating point operations. This optimization reduces 
the number of bytes to load and store per site from 576 bytes to 384 in single 
precision. 



3 The latter statement implies that, for example, all for loops have to be unrolled. 
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4-2.1. Memory pattern 

Due to the hardware design of NVIDIA GPUs one has to adopt special 
memory layouts to efficiently utilize the memory bus to global memory. The 
peculiarity of these devices is that memory transactions of threads of the 
same warp are coalesced if they reside in the same 128-byte aligned seg- 
ment in global memory. Consequently, neighboring threads (i.e. neighboring 
sites) should access neighboring memory addresses to achieve high memory 
throughput. A natural memory layout where the gauge links (SU(3) matri- 
ces) are stored in one block in memory does not fulfill these requirements, 
hence the index order of the gauge fields in memory has to be adapted. 



The authors of [25( and [20[, e.g, use the native CUDA datatype Floaty 
and therefore distribute the 12 real numbers of a SU(3) element (in the 12 
parameter representation, see Sec. I4.2[) to three Floaty arrays. In contrast, 
we build on a more flexible way by employing one large float or double 
array, respectively, in combination with an access pattern class that hides 
the memory layout from the user. This strategy allows us to easily change 
the memory layout depending on the properties of the underlying application. 

Here we list explicitly the memory patterns that are in use in our gauge 
fixing applications, whereby the slowest running index is listed first: 

• StandardPattern (natural layout): t, x, y, z, /i, i, j, c 

• GpuPattern: /i, i, j, c, p, [t, x, y, z] 

• GpuPatternTimeslice: t, /i, i, j, c, p, [x, y, z] 

• GpuP atternP arityPriority: p, /i, i, j, c, [t, x, y, z] 

• GpuP atternTimesliceP arityPriority: t, p, /x, i, j, c, [x, y, z] 

where i,j G {0, 1,2} are the matrix indices, c identifies real (c = 0) and 
imaginary (c = 1) part of the complex number, \x G {0, . . . , 3} is the direction 
of the link, t is the index in temporal direction and x, y, z correspond to the 
spatial components. The index p G {0, 1} stands for parity (even and odd, 
respectively) and in those patterns where it is in use the space-time indices 
are split into two groups 

[t,x,y,z] p := {t,x,y, z\t + x + y + z mod 2 = p] (32) 

and equivalently for [x, y, z] . Parity splitting is necessary to achieve coa- 
lesced access to global memory, since we operate on the parity even and odd 
sublattices separately (see Alg. [TJ). 



16 



The GpuPattern is used in the single GPU implementations of Landau 
and maximally Abelian gauge. For Coulomb gauge we employ the GpuPat- 
ternTimeslice for the global gauge field array and the GpuPattern in kernels 
that operate on 1 x N% sublattices, i.e., within a single time-slice. To re- 
duce memory traffic between the nodes in the multi-GPU implementation 
we adopt the GpuPatternTimesliceParity Priority. This allows that only the 
active parity of the time-slices at the border can be transfered between nodes. 
All applications use the StandardPattern to read and write files with the nat- 
ural ordering. 

All patterns assume that the global array is allocated for full 18 parameter 
SU(3) links although the applications load and store only 12 parameters. 

4-2.2. Representation of the SU(3) link variables 

We define a template class SU3 with a template parameter that deter- 
mines the storage type. For matrices that reside in the global memory array 
we offer a class Link with three parameters: (1) the pointer to the global 
memory array, (2) a lattice site given in terms of an object of type Sitelndex 
and (3) the direction /x. No memory is allocated for SU3(Link) variables. 
For local matrices we offer the class Matrix which allocates local memory 
(or uses registers when possible) for matrix elements. Functions for copying 
between SU3{Link) and SU3{Matrix) are implemented, as well as functions 
to load only the first two rows (12 parameter representation) of the matrices 
as well clS cL function to restore the third row. 

4-3. The eight-threads-per-site strategy 

Within every iteration of the gauge fixing algorithms each site update 
needs its adjacent links. These are read from global memory and after the 
update they have to be written back to global memory. After having restored 
the third line, these eight SU(3) matrices per site equal 8 x 18 reals = 144 
reals and therewith exceed the register limit of 63 per thread what results in 
register spills to global memory and as a consequence negatively effects the 
bandwidth bound performance of the kernel. 

With the purpose of reducing register spills, we switch to a finer paral- 
lelization granularity: instead of assigning one thread to one lattice site we 
now tie eight threads to a single lattice site, i.e., one thread for each of the 
eight matrices that are involved in a site update. As a result, each thread 
needs only 18 registers to store the gauge link. 
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In order to avoid warp divergences the kernel is invoked with a thread 
block size of 8 x 32 = 256. By doing so, each of the eight warps takes care of 
one neighbor type of the 32 sites and thus all threads within one warp follow 
the same instruction path. 

The gauge transformation is then accumulated in shared memory. Since 
one operates on the SU(2) subgroups of SU(3) and an SU(2) matrix can 
conveniently be represented by four reals, this requires 4 x 32 = 128 reals 
or 512 bytes (SP) or 1024 bytes (DP) per thread block. To avoid race con- 
ditions on the shared array the accumulation is done using the atomic^add 
function in single precision and by explicit serialization using __syncthreads() 
in combination with ^/-statements in double precision^]. 

The benefit of this strategy is that, in single precision, no register spillings 
occur at all if no further constraints on the kernel are applied (see Sec. 14. 4p 
and for double precision, register spills are drastically reduced. The drawback 
of the current implementation compared to a more conventional one-thread- 
per-site strategy is that the block size is increased and thus the occupancy per 
site of the multiprocessors is decreased, i.e., less sites are computed concur- 
rently. Nevertheless does this strategy result in a clear overall performance 
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4-4- Optimizations 

Besides the aforementioned algorithmic optimizations we further tuned 
our code by optimizing the CUDA settings. 

First of all we set launch bounds to individual kernels: by specifying the 
number of threads per block and a minimum of active blocks a bound on the 
maximal register usage is given. Without launch bounds the compiler uses 
45 registers in the overrelaxation kernel for Landau and Coulomb gauge, 
resulting in a theoretical occupancy of 42%. By setting the register limit 
to 32 the theoretical occupancy is increased to 67% on the cost of a small 
amount of register spilling (24 byte stack frame, 24 byte spill stores, 40 byte 
spill loads) The same settings are applied to the other gauge fixing kernels. 

Fermi devices have a LI cache that physically shares the same 64 kB 
hardware (per SM) with shared memory. The size of the LI cache and shared 



6 atomic-add is not supported for datatype double. 

7 The given values for register usage and spilling are for CUDA Toolkit 5.0 compiled 
for compute capability 2.0. They vary between different between CUDA 4.x and 5.0 but 
the optimal launch bounds arc found to be the same. 



18 



memory can be set by the user for each kernel. Since we only need 512 Byte 
shared memory per block and a maximum of 4 blocks is possible, we only 
need a total of 2 kB shared memory per SM. This allows us to set the kernel 
to a prefer LI cache configuration which means 16 kB shared memory and 
48 kB of LI cache. With this setting the register spilling introduced by the 
launch bounds is cached more efficiently 

By a global compiler switch, the use of LI cache can be set to either 
caching (default) or non-caching (-Xptxas -dlcm=cg) loads. By using non- 
caching loads our applications shows a small improvement in performance. 
This is due to the fact that the use of global memory is designed such that 
only in the beginning of each kernel the matrices are loaded to local memory 
(i.e., into registers) and after all operations are finished they are written back. 
In between there is no reuse of cached data and thus there is no benefit in 
caching at all. With non-caching load the LI cache is solely used for register 
spilling and write-backs of register spills to the device memory are reduced 
or totally removed. 

In all applications we compile with the use-fast-math switch. Single pre- 
cision operations are then replaced by faster implementations on the expense 
of precision though we did not experience any effects by this setting. For 
double precision operations there is no such option. 

4-5. Numerical accuracy 

In the following we investigate the accumulation of numerical rounding 
errors within our lattice gauge fixing applications. A suitable measure is the 
conservation of unitarity of the SU(3) matrices during the progress of the 
algorithm through many iterations. In Fig. [1] we show 



from a run over 12000 iterations of the overrelaxation update on a 32 4 lat- 
tice in single (SP) and double (DP) floating point precision. Moreover the 
plot shows lines corresponding to a mixed precision (MP) ansatz which cal- 
culates the overrelaxation gauge update on the SP gauge fields in full DP 
(see Fig. lA.lip while the less precision demanding application of the gauge 
transformation to the links (Step 2 in Alg. [TJ) is performed in SP. 

In DP, both, the average and even the maximal value stay well below 
10~ 12 whereas in SP the error accumulates to the order 10~ 3 . To overcome 
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Figure 1: Conservation of unitarity (|1 — det(Z7)|) in SP, MP and DP 

the loss of unitarity, one may use the unitarity as a constraint and thus 
reproject the links to SU(3) after a given number of iteration steps. 

The peaks in the SP maximum lines are individual outliers that occur 
approximately every 1000 iterations in one of the links of a 32 4 lattice on our 
GTX 580, whereas they could not be detected on the Quadro 4000. 

Whether or not the loss of high precision unitarity in SP is of significance, 
depends of course on the individual problem the code is applied to. In Fig. [2] 
we show the value of the Landau gauge functional which is the sensitive quan- 
tity in our applications, in different precisions, again over 12000 iterations^) on 
a 32 4 lattice. It becomes obvious that SP without reprojection is not a good 
choice for lattice gauge fixing since the value of F 9 even starts to decrease af- 
ter around 3000 iterations. The DP functional value increases monotonically 
and finally reaches a plateau, this fact together with the previous mentioned 
maintenance of high precision unitarity lets us conclude that a DP simula- 
tion, even without reprojection, is very accurate. Thus, we can use the DP 
value as a benchmark for the other approaches. In the inner plot of Fig. [2] 
we show the relative deviation of each curve to the final DP result: SP with 
reprojecting to unitarity after every 100 steps and MP without reprojection 



The gauge precision thereafter was 8 < 6.0 x 10 11 for the run in DP. 
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Figure 2: The value of the Landau gauge functional Fl &ndau [U] as a function of the number 
of iterations of the overrelaxation kernel in single (SP), mixed (MP) and double precision 
(DP). In addition, the evolution of the functional value is shown when a reprojection 
to SU(3) is done every 100 iteration steps in SP and MP. The inner plot gives relative 
deviation of all curves (except SP without reprojection) from the final functional value in 
DP. 



stay within a relative deviation of 2 x 10" 5 and MP with reprojection even 
within 5 x 10~ 6 . Moreover, the MP line shares the same qualitative behavior 
as the DP curve (monotonicity, convergence to a constant). 

Therefore, our conclusion for the required floating point precision in lat- 
tice gauge fixing is as follows: in case one is primarily interested to actually 
fix the gauge of a gauge field configuration without being interested in the 
precise value of the resulting gauge functional, SP with reprojection is fine. 
If it is required to obtain the gauge functional value within a precision of 
no more than 10 -5 , MP with reprojection is recommended since it retains 
most of the SP performance, as we will show in Sec. [6j opposed to DP which 
should only be chosen when one depends on a high precision result in the 
value of the gauge functional. 

5. Multi-GPU 

In the following discussion we will replace the space-time argument x = 
(x, t) by the time argument t alone wherever the x dependence is of no 
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significance in the given context. Moreover, we will assume that one MPI 
process is assigned to one GPU device and thus use the terms process and 
device interchangeably. 

In order to share the work which has to be performed locally on each 
lattice site, between several processes, we adopt a straightforward domain 
decomposition: we split the lattice of size N% x N t into N t /nprocs partitions, 
where nprocs denotes the number of processes involved in the parallelization. 
For Coulomb gauge this splitting is trivial since, as we discussed above, we 
can operate on the different time-slices separately and only need to apply 
the final gauge transformation g(t) of the time-slice U^{t) to the temporal 
components of the preceding time-slice U^t — 1). This makes on-the-fly 
communication between devices for Coulomb gauge fixing unnecessary. 

Manifestly covariant gauges like the Landau gauge and the maximally 
Abelian gauge, on the other hand, are more subtle. Here, all four neighbor- 
ing links in the negative /i-direction have to be collected on each site x in 
order to calculate the gauge update g(x) which subsequently is applied to 
all the eight links connected to the site x. Thus, with the ansatz of splitting 
the lattice across the temporal direction, we have to exchange the temporal 
components Uq{x) of the gauge fields on time-slices that lie at the boundary 
of two processes. 

5.1. Data exchange between neighboring devices 

If we label the minimum time-slice that resides on a given device with 
£ m in and the maximum time-slice with t maX) then only the calculation of the 
local gauge transformations g(t m i n ) depends on the data exchange between 
different processes since for its calculation the gauge links U^(t m i n — 1) that 
reside on the neighbor process are needed. Note that since we operate on the 
parity even and odd lattice sites consecutively, the currently active parity 
of the time-slice U^(t msix ) is completely unaffected by the exchange with the 
neighboring process that only touches the passive parity part of U^tm^). 
That means, on a given process, all time-slices except t m i n can be updated 
without exchanging any information with the neighbor processes. In order to 
update the t/ M (t m i n ) on all devices, however, the following set of instructions 
has to be carried out on each device in order to transfer the links U (t max ) 
of device i to device 

1. cudaMemcpyDeviceToHost of U (t max ) (inactive parity) 

2. MPLSend of U (t max ) to device % + 1 and MPLRecv of U (t min - 1) 
from device % — 1 
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3. cudaMemcpyHostTo Device of U (t min — 1) 

4. update i7 M (i m i n ) (active parity) which affects U (t min — 1) (inactive) 

5. cudaMemcpyDeviceToHost of Uo(t m i n — 1) (inactive parity) 

6. MPPSend of £/o(^min — 1) to device i and MPLRecv of Uo(t ma , x ) from 
device i + 1 

7. cudaMemcpyHostToDevice of t/o(£ m ax) 

5.£. Date pattern 

The memory pattern GpuPatternTimesliceParityPriority, introduced in 
Sec. I4.2.1[ will De the pattern of choice for applications that get accelerated 
by a time-slice splitted multi-GPU approach. Not only is the time-index 
running slowest and thus allows to handle different time-slices separately 
in the latter mentioned pattern, moreover the time-slice internal pattern is 
very advantageous: each time-slice is split into its two parity parts of which 
each has the Dirac index /i running fastest, followed by the row index of the 
individual gauge matrices. 

This layout ensures that the data which has to be exchanged, the first two 
rows (12 parameter representation) of the link variables Z7o(t m in) of a given 
parity, lie contiguous in device memory. The size of the data block that has 
to be exchanged is then given by the size of a time-slice multiplied by 1/2 
(parity), 1/4 (Dirac index) and 2/3 (12 parameter representation), thus 1/12 
in total. 

5.3. Asynchronous memory transfers 

We target at hiding the data exchange between different devices by over- 
lapping them with calculations on the unaffected time-slices. Replacing the 
CUDA function cudaMemcpy with cudaMemcpyAsync results in a non block- 
ing copying process from host to device or vice versa. Making use of different 
cudaStreams a device can then perform a copying request and execute a ker- 
nel at the same time. 

In order to investigate how many time-slices are needed per device to 
fully hide the data exchange between two devices, we measured the time for 
the execution of the overrelaxation kernel on one time-slice and the time for 
a transfer of 1/12 of a time-slice for different spatial lattice sizes Nf and 
averaged the result over 1000 iterations, see Tab. [3j As we can read off 
from the table, the asynchrounous kernel execution on two time-slices takes 
longer then a device to host or host to device copy process, respectively. As 
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N 3 

s 


D2H [/is] 


H2D [us] 


kernel [/is] 


D2H/kernel 


H2D/kernel 


16 


0.0398 


0.0368 


0.0209 


1.90 


1.76 


32 


0.2543 


0.2276 


0.1443 


1.76 


1.58 


64 


1.2510 


1.1830 


1.0489 


1.19 


1.13 


128 


8.9597 


8.7169 


8.3041 


1.08 


1.05 



Table 3: The time needed to copy the relevant part (1/12) of a time-slice from device to 
host (D2H) and host to device (H2D) compare to the time needed to update one time-slice 
with the overrelaxation kernel (all in /xs) averaged over 1000 iterations for different spatial 
volumes Nf. The two most right columns give the ratios. 



discussed above, the necessary data exchange between two devices includes in 
total four such copy processes and thus eight time-slices are enough to reach 
a complete overlap of data exchange from device to host (host to device) and 
calculations in the inner part of the domain. 

So far we neglected the data exchange via MPI between the two neighbor- 
ing host processes. As for the data exchange between host and device, here 
again it is advantageous to use non blocking functions for the data exchange, 
i.e., MPIJsend and MPIJrecv. By doing so we can again overlap the data 
exchange between the processes by calculations on time-slices that are not 
involved in the exchange. 

In practice, we implemented the overlap of calculations with the data 
exchange between the processes and between host and device as a method 
of a communicator class. Then we only have to set up a certain update 
type (overrelaxation, simulated annealing etc.) and the apply method of the 
communicator object applies that update including full overlap with the data 
exchange. 

6. Results 

In this Section, we firstly examine the performance of the code on various 
devices including multiple GPUs. There, we pick the Landau gauge overre- 
laxation kernel as a representative for all kernels and gauges. Secondly, we 
outline a few sample results obtained by the application of our lattice gauge 
fixing code. 

6.1. Performance on single GPUs 

In Fig. [3] we show the performance of the overrelaxation kernel on the 
GTX 580 for different spatial volumes as a function of the temporal lattice 
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extent. The data stems from an average of one hundred repeated applications 
with 1000 iterations each. We achieve up to 370 GFlops in SP, up to 300 
GFlops in MP and 80 GFlops in DP. The max. performance of 370 GFlops 
corresponds to an execution time of 6.4 s with the given lattice size and 
number of iterations. For the smaller lattices the theoretical occupancy of 
the device is not reached and therefore the maximum performance is not 
achieved. Apart from that, we find almost constant performance for all 
lattice volumes. 

In Fig. |4] we compare the performance of different Fermi devices on lat- 
tices of size 32 4 . Our top performers in SP are the GTX 580 with nearly 
370 GFlops, followed by the GTX 480 at around 300 GFlops. The difference 
between these devices results from the reduction in chip clock, number of 
SMs and bandwidth. The scientific GPUs are designed for a longer runtime 
and therefore the chip clock is remarkably lower. Thus, the performance 
of the C2070 is only close to 200 GFlops, the Quadro 4000 is at around 
120 GFlops. Noteworthy is the difference between single and double preci- 
sion: the theoretical ratio of SP to DP arithmetic operations for the scientific 
devices is 1:2, whereas the consumer GPUs have a ratio of 1:8. Accordingly, 
the performance ranking changes: still the GTX 580 performs best with ap- 
proximately 80 GFlops, now followed by the C2070, slightly faster than the 
GTX 480 at around 70 GFlops. Thus, even for the scientific GPUs the the- 
oretical factor of a half compared to SP could not be reached. The reason 
is that approximately twice as many registers are needed in DP and there- 
fore even for the maximum of 63 registers spilling occurs. Additionally, the 
theoretical occupancy is reduced by the increase in registers. 

6.2. Performance on multi-GPUs 

Our multi-GPU performance tests have been carried out on the "mephisto" 
cluster at the University of Graz. The cluster provides five compute nodes 
with four NVIDIA Tesla C2070 GPUs and CUDA 5.0. Moreover, each node 
offers two Intel Xeon Six-Core CPUs X5650 ("Westmere") @ 2.67GHz on 
each node. The nodes are connected via InfiniBand and OpenMPI 1.4.3 and 
CUDA 5.0 is installed. 

In the plot of Fig. [5] we show that linear weak scaling is reached with this 
strategy. The test have been performed on lattices of size 64 3 x 32 per GPU 
(64 3 x 512 in total with 16 GPUs) and 48 4 per GPU (48 3 x 768 in total with 
16 GPUs). The higher performance of the spatial volume of 64 3 is simply 
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Figure 3: Performance for different spatial volumes as a function of the temporal lattice 
extend in SP, MP and DP on a GTX 580. 



due to higher occupancy: since we operate on single time-slices at a time, 
the lattice of spatial size 48 3 is not sufficient to efficiently occupy the device. 

In Fig. [6] strong scaling is tested. For a total lattice size of 64 3 x 256 we find 
close to linear strong scaling up to 16 GPUs which corresponds to 16 time- 
slices per device. On a lattice of size 64 3 x 128 we find for 16 GPUs (eight 
time-slices per device) a performance loss of 15-30% (DP vs. SP). When 
moving on to a smaller temporal lattice extent, N t = 96, the performance 
decreases further. Moreover, for this lattice size, no gain in performance is 
apparent when adopting 16 instead of 12 devices. 

6.3. Comparison to existing CPU code 

Lastly, we compare our performance to the overrelaxation kernel of the 



FermiQCD library 26]. The FermiQCD toolbox is open source (C++) and 
has been designed to be easy to use while at the same time offering the user 
many applications for lattice QCD, in some applications at the expense of 
performance. To our knowledge, it is the only publicly available code that 
supports lattice gauge fixing with the overrelaxation algorithm in Landau 
gauge. We would be happy to compare our code to a wider range of imple- 
mentations. 
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Figure 4: Performance of the Landau overrelaxation kernel on different NVIDIA devices 
in single (SP), mixed (MP) and double precision (DP) on a 32 4 lattice. 

As test bed we chose an Intel Xeon Westmere CPU on the mephisto 
cluster, see Sec. 16.21 We run the FermiQCD Landau gauge overrelaxation 
kernel in SP on a lattice of size 32 4 on a single core in avoidance to reflect 
parallelization artifacts. Then we compare the performance to our code (same 
lattice size and precision) from the Tesla C2070 that the cluster offers. 

FermiQCD reaches a performance of 0.414 GFlops and our code reaches 
for this setup 195.08 GFlops. Thus, our implementation executed on the 
Tesla GPU is equivalent to FermiQCD executed on « 470 CPU cores of the 
given type, under the naive assumption of linear scaling for the CPU code. 

6.4- Temperature dependence of the simulated annealing algorithm 

In Sec. 12.31 we discussed the importance of keeping the temperature gra- 
dient small in the simulated annealing. Therefore, it is crucial to set up the 
right temperature interval in order not to waste many iteration steps in a 
temperature region where the gauge functional is insensitive to. 

In Fig. [7] we show an example of the evolution of the gauge functional 
F 9 [U] and the gauge precision 9 of the Landau gauge and the maximally 
Abelian gauge. The simulation has been performed on a hot gauge field, i.e., 
having all gauge links set to random SU(3) matrices. The lattice size is 32 4 
and for both cases 10,000 simulated annealing steps have been carried out. 

As one can read of from the plot, in this case, the sensitive region where 
the gauge functional changes most is for Landau gauge below T < 4 and for 
the maximally Abelian gauge slightly lower, T < 2. 
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Figure 5: Weak scaling on the Tesla C2070 in single (SP), mixed (MP) and double precision 
(DP). The full symbols correspond to a lattice size of 64 3 x 32 per GPU and the open 
symbols to 48 4 per GPU. 

6.5. Cooling down to maximally Abelian gauge 

Here, we aim at reducing the time and number of iterations to gauge fix a 
configuration to the maximally Abelian gauge. We test overrelaxation versus 
a combination of simulated annealing, stochastic relaxation and overrelax- 
ation in terms of required number of iterations to gauge fix a sample gauge 
configuration with inverse coupling (3 = 5.7 and lattice size 32 4 . 

Both approaches use an overrelaxation parameter of u = 1.35, the sec- 
ond method starts off by applying 2000 simulated annealing steps including 
three microcanonical updates after each step (i.e., 8000 steps in total). Sub- 
sequently, a maximum of 2000 stochastic relaxation steps are applied and 
lastly overrelaxation until the precision 9 < 10~ 12 is reached. Method one 
directly applies the overrelaxation kernel until convergence. 

We started both variants on 100 randomly chosen points on the gauge 
orbit. Method one succeeded to find an optimum for 84 out of the 100 copies, 
the remaining 16 got stuck at a value of 9 « 10~ 7 until the algorithm was 
stopped after one hundred thousand iterations. Method two was successful 
for 97 copies. 

The average number of required iterations (the combined number of all 
updates) is given in Tab. HI together with the final value of the gauge func- 



28 




1 1 1 1 1 

4 8 12 16 

Number of GPUs 



Figure 6: Strong scaling on the Tesla C2070. The spatial lattice volume is kept fixed at 
64 3 and the total temporal extent varies for the three lines (per precision) from the top 
downwards N t = 256, 128, 96. 





OR 


SA/SR/OR 


# of converged copies 


83 


97 


$ of iterations 


272340 ± 8405 


16701 ± 2562 


F 9 [U] 


0.74356431(39697) 


0.74423815(10996) 



Table 4: Comparing the application of the overrelaxation algorithm (OR) solely, to the 
subsequent application of simulated annealing (SA) with microcanonical steps, stochastic 
relaxation (SR) and OR on 100 copies of a gauge field of lattice size 32 4 . 

tional F 9 [U}. 

As it is evidence from the data, the combined approach of simulated an- 
nealing, stochastic relaxation and overrelaxation outperforms the pure over- 
relaxation method in terms of number of iterations by a factor of almost two 
and moreover reaches an higher average value of the gauge functional while 
bringing more gauge copies to converge. The average time spend by the de- 
vice (GTX 580) per gauge copy was four minutes for method two and slightly 
below seven minutes for method one. It has to be stressed, however, that 
not all gauge copies converged and hence these copies enter the average of 
the execution time with a biased weight since the kernel was executed until 
the maximum number of iterations was reached. 
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Figure 7: The temperature dependence in simulated annealing of the gauge functional 
F 9 [U] and the gauge precision 6 of the Landau gauge and the maximally Abelian gauge. 



6.6. Towards the global maximum of the Landau gauge functional 

We take the same gauge field configuration with /3 = 5.7 and lattice size 
32 4 of the previous subsection and now aim at finding Landau gauge Gribov 
copies with gauge functional values as high as possible. Three runs with 100 
random starts on the gauge orbit have been performed. The difference of the 
three runs lies in the number of simulated annealing steps that are applied 
before the overrelaxation kernel takes over. We apply zero, three thousand 
or ten thousand simulated annealing steps, respectively. The temperature 
has been decreased from 4 down to 10 -4 . Each simulated annealing step 
is followed by three microcanonical updates. Subsequently, we apply the 
overrelaxation kernel until 9 < 10 -10 . 
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Figure 8: The relative deviation from the maximal gauge functional. From left to right 
with 10000, 3000 and zero simulated annealing steps. 



We determined the maximum gauge functional value of all the runs, which 
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we denote by -F max and define the relative deviation from it by 
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The latter is plotted in histograms in Fig. [8] for all the three runs. The plot 
clearly demonstrates how the application of simulated annealing increases the 
chance to find the global maximum, especially on a relatively large lattice of 
size 32 4 . This test has been performed in parallel on two Tesla C2070 devices 
within several hours. 

7. Summary 

We presented a CUDA implementation for gauge fixing in lattice gauge 
field theories based on the relaxation algorithms. The code is based on the 
cuLGT package^ and supports the Landau, Coulomb and maximally Abelian 
gauge fixing conditions. 

The implementation and the various optimization strategies have been 
discussed in detail. We showed that simulated annealing and overrelaxation 
can heavily be accelerated by employing GPUs. We listed convergence re- 
sults in different floating point precisions and concluded that a mixed preci- 
sion ansatz that performs only the critical parts of the simulation in double 
precision is a good compromise in terms of precision (~ 10~ 5 relative to DP) 
and performance (80% — 90% of SP). 

A maximum sustained performance of 370 GFlops on a single GTX 580 
has been reached and moreover linear scaling on 16 Tesla cards with 3.5 
Teraflops, given that the number of time-slices per device does not deceed 



Lastly, we demonstrated how the combination of simulated annealing 
and the various relaxation flavors can be tuned in such a way that either fast 
convergence to the gauge of choice is reached or alternatively that a gauge 
functional value as high as possible is obtained. 

We are currently preparing tests on the Kepler architecture, updates on 
Kepler performance will be available on our homepage shortly. 



9 Both is available for downloaded under www.cuLGT.com 
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Appendix A. Counting flops 

As we discussed in Sec. I2.4[ the main work of Alg. [1] consists of applying 
the new update g(x) to the neighboring links of site x, i.e., Step 2 of the 
algorithm. We will now analyze this more quantitatively. In Fig. IA.9l we show 
the code snipplet of cuLGT for the multiplication of a SU(3) matrix with a 
SU(2) subgroup element from the left. Here, the SU(2) subgroup element 
is stored as a an object of class Quaternion (Cayley-Klein four parameter 
representation). 

template<class Type> 

void SU3<Type> : : lef tSubgroupMult ( lat_group_dim_t i, 

lat_group_dim_t j , Quaternion<Real> *q ) 

{ 

for( lat_group_dim_t k = 0; k < 3; k++ ) 
{ 

Complex<Real> IK = q->get( 0, ) * get(i,k); 
IK += q->get( 0, 1 ) * get(j.k); 

Complex<Real> JK = q->get( 1, ) * get(i,k); 
JK += q->get(l,l) * get(j.k); 

set( i, k , IK ) ; 
set ( j , k, JK ) ; 

} 

} 

Figure A. 9: Muliplication of a SU(3) matrix by a SU(2) subgroup element in Quaternion 
representation from the left. The total number of flop is 84 per SU(2) subgroup iteration, 
sec discussion in the text. 
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As we can read of from the figure, in the loop over k we encounter four 
complex multiplications (six flop each) plus two complex additions (two flop 
each), thus 28 ■ 3 = 84 flop for the update of U^x) and equivalently for 
Ufj,(x — jX). Consequently, the number of flop for Step 2, in four dimensions, 
sums up to 84 • 2 ■ 4 = 672 per lattice site and SU(2) subgroup and hence to 
672 ■ 3 = 2016 for SU(3). 

As mentioned before, the above part is the same for all gauges and all 
update types. Only Step 1 of Alg. [I] distinguishes between different gauges 
and update types. Let's consider for example an overrelaxation update for 
Landau gauge. The latter consists of calculating g(x) according to Eq. (fl9l 
plus a first order approximation of the exponentiation g(x) — > g u {x). In 
the cuLGT code, the sum of Eq. (1191) is done on the Quaternion objects. 
Extracting the four reals of Quaternion representation of a SU(2) subgroup 
element of SU(3) requires four flop, see Fig. IA. 101 The Quaternion objects 
are then gathered in an array in shared memory (shA) according to Eq. (fT9]) . 
This means four flop (four additions) for each Quaternion. Thus the number 
of flop in Eq. f[T9"j) is eight per link variable and in 4D eight link variables are 
involved, i.e., 64 flop per lattice site and SU(2) subgroup iteration or 192 for 
SU(3). 

Subsequently, the overrelaxation update g(x) — > g u (x) is calculated. 
Counting each operation in Fig. I A. 1 1 1 as one floating point operation (rsqrt 
corresponds to two operations), the effective number of flop for the overre- 
laxation update is 22 per lattice site and SU(2) subgroup, thus 66 for SU(3). 

Summing up, the overrelaxation algorithm in SU(3) for Landau gauge 
requires 

• 192 flop to gather the neighboring links U^x), U^(x — /2), 

• 66 flop for the overrelaxation update, 

• 2016 flop to apply the new g(x) to U^(x), U^(x — (1) 

and thus in total 2274 flop/site. Note that we do not take the extra Flops 
for the reconstruction of the third row of the SU(3) matrices into account. 

For the heatbath kernel of the simulated annealing algorithm the number 
of flops cannot be calculated correctly because of the non-deterministic loops 
with random-number-dependent termination conditions. We counted the 
flops as if every loop is only run once and each RNG call is counted as one flop. 
Both choices are very conservative. Therefore, a comparison of simulated 
annealing implementations should be based on pure time measurements. 
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template<class Type> 

Quaternion<Real> SU3<Type>: :getSubgroupQuaternion( 

lat_group_dim_t iSub, lat_group_dim_t jSub ) 

{ 

Quaternion<Real> q; 

Complex<Real> temp; 

temp = mat .get (iSub, iSub) ; 

q[0] = temp.x; 

q[3] = temp.y; 

temp = mat. get (j Sub, j Sub) ; 

q[0] += temp.x; 

q[3] -= temp.y; 

temp = mat .get(iSub, jSub) ; 

q[2] = temp.x; 

q[l] = temp.y; 

temp = mat .get ( jSub, iSub) ; 

q[2] -= temp.x; 

q[l] += temp.y; 

return q; 



Figure A. 10: Extracting a SU(2) subgroup element of SU(3) in Quaternion representation. 
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void OrUpdate: : calculateUpdate ( volatile Real (&shA) [4*NSB] , 

short id ) 

{ 

Real ai_sq = shA[id+NSB] * shA[id+NSB] 

+shA[id+2*NSB] * shA [id+2*NSB] 
+shA[id+3*NSB] * shA [id+3*NSB] ; 



Real aO_sq = shA [id] * shA [id] ; 



Real b = (orParameter*aO_sq + ai_sq)/(aO_sq + ai_sq) ; 
Real c = rsqrt( aO_sq + b*b*ai_sq ); 



shA [id] *= c; 

shA[id+NSB] *= b*c 

shA [id+2*NSB] *= b*c 

shA[id+3*NSB] *= b*c 



Figure A.ll: The overrelaxation update requires 22 flop per lattice site and SU(2) sub- 
group. 
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